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PRELIMINARY AMENDMENT 

Assistant Commissioner for Patents 
- Washington, D.C. 20231 
Sir: 

Preliminary to examination please amend the above identified application 
as follows: 



IN THE SPECIFICATION 

On page 32, hne 9, change "can know" to --to know-. 

On page 35, line 7, change "into which clusters and see the degrees" to - 
into which other clusters and the degrees—. 

IN THE CLAIMS 

Please amend Claims 3-11 and 13 as follows: 



3. (Amended) A document categorizing method according to Claim 1, 
wherein said cluster merging process is performed such that [in what] the 
manner in which feature elements, which characterize respective clusters under 
consideration as to whether they should be merged or not, appear in [the] said 
respective clusters under consideration is examined, and cluster merging is 
performed in accordance with the manner in which the feature elements appear. 
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4. (Amended) A document categorizing method according to [one of Claims] 
Claim 1 [to 3], wherein said cluster merging process is performed on at least [for] 
two clusters, and after completion of [the] said cluster merging process a first 
time, [the] said cluster merging process is repeatedly performed [repeatedly for] 
on the resultant set of clusters until no further cluster merging occurs. 

5. (Amended) A document categorizing method according to [one of Claims] 
Claim 1 [to 4], wherein after completion of said cluster merging process, 
supplem.entary information indicating that cluster merging has been performed 
and also indicating the basis on which the cluster merging has been performed is 
output. 

6. (Amended) A document categorizing method for categorizing a plurality of 
documents into a plurality of clusters according to semantic similarity, said 
method being characterized in that: 

after categorizing said plurality of documents into a plurality of clusters 
according to semantic similarity, a cluster merging process is performed such 
that relations among clusters of said plurality of clusters are evaluated on the 
basis of documents included in the respective clusters, and two or more clusters 
having a degree of relation equal to or higher than a predetermined value are 
combined together; and 

information representing which clusters have [bee] been merged together 
and also representing the degrees of relation among the merged clusters is 
generated^ and said information is output together with the categorization result 
to be presented to a user so that when final clusters obtained as a result of said 
cluster merging process are displayed, the user can see in what manner said 
cluster merging process has been performed to obtain said final clusters 
[cluster]. 

7. (Amended) A document categorizing method according to Claim 6, 
wherein said generated information , which is output so as to enable the user to 
see in what manner said cluster merging process has been performed^ is 
presented [given] by modifying the manner of displaying the cluster names of 
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respective clusters merged together in accordance with the degree of relation 
among said clusters merged together , said cluster names being displayed in such 
a manner that when said degree of relation among said clusters is higher than a 
predetermined value, said cluster names are displayed in an AND form, 
[however] and when said degree of relation among said clusters is lower than 
[the] said predetermined value, said cluster names are displayed in an OR form. 

8. (Amended) A document categorizing method according to Claim 7, 
wherein when said cluster names are displayed in [the] said AND form, said 
cluster names of the respective clusters are displayed successively in a single 
horizontal line or [the respective cluster names] are displayed in different lines, 
and [while] when said cluster names are displayed in [the] said OR form, a 
delimiter is inserted between adjacent cluster names of the respective clusters. 

9- (Amended) A document categorizing method according to Claim 7 [or 8], 
wherein when a [certain] first cluster includes a second cluster therein, the 
name of said second cluster included in said [certain] first cluster is enclosed 
within brackets and placed after the name of said [certain] first cluster. 

10. (Amended) A document categorizing apparatus for categorizing a plurality 
of documents into a plurality of clusters according to semantic similarity, said 
apparatus comprising: 

a clustering unit for categorizing a plurahty of documents into a plurality 
of clusters in accordance with semantic similarity; and 

a cluster merging unit which evaluates the relation among the plurality of 
clusters created by [the] said clustering unit on the basis of the documents 
included in the respective clusters and then combines two or more clusters 
having a degree of relation equal to or higher than a predetermined value. 

11. (Amended) A document categorizing apparatus for categorizing a plurality 
of documents into a plurality of clusters according to semantic similarity, said 
apparatus comprising: 
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a clustering unit for categorizing a plurality of documents into a plurality 
of clusters in accordance with semantic similarity, 

a cluster merging unit which evaluates the relation among the plurality of 
clusters created by [the] said clustering unit on the basis of the documents 
included in the respective clusters and then combines two or more clusters 
having a degree of relation equal to or higher than a predetermined value; 

a cluster-merging-process information generator for generating cluster- 
merging-process information representing which clusters have been merged 
together and also representing the degrees of relation among the merged 
clusters wherein said cluster-merging-process information is to be displayed 
when final clusters obtained via said cluster merging process performed by said 
cluster merging unit are displayed so that a user can see in what manner said 
cluster merging process has been performed to obtain said final cluster; and 

categorization result outputting means for outputting said cluster- 
merging-process information such that said cluster-merging-process information 
is included in the categorization result to be presented to said user. 

13. (Amended) A storage medium on which a document categorizing program 
for categorizing a plurality of documents into a plurality of clusters according to 
semantic similarity is stored, said document categorizing program comprising: 

a clustering step for categorizing a plurality of documents into a plurality 
of clusters in accordance with semantic similarity; 

a cluster merging step in which the degrees of relation among clusters of 
said plurality of clusters obtained in said clustering step are evaluated on the 
basis of documents included in the respective clusters, and two or more clusters 
having a degree of relation equal to or higher than a predetermined value are 
combined together; 

a cluster-merging-process information generating step for generating 
cluster-merging-process information representing which clusters have been 
merged together and also representing the degrees of relation among the merged 
clusters wherein said cluster-merging-process information is to be displayed 
when final clusters obtained via said cluster merging process performed by said 
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cluster merging step [unit] are displayed so that a user can see in what manner 
said cluster merging process has been performed to obtain said final cluster; and 

a step for outputting said cluster-merging-process information such that 
said cluster-merging-process information is included in the categorization result 
to be presented to said user. 



Claims 1-13 are presented for examination. Claims 3-11 and 13 are 
amended to correct typographical and grammatical errors, and to remove 
multiple claim dependencies. The specification is amended to correct 
grammatical errors. No new matter is added by these changes. 

Favorable consideration is respectfully requested. 



Please address all correspondence to: 

Epson Research and Development, Inc. 
Intellectual Property Department 
150 River Oaks Parkway, Suite 225 
San Jose, CA 95134 
Phone: (408) 952-6000 
Fax: (408) 954-9058 
Customer No. 20178 
Date: February 2, 2001 
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DOCUMENT CATEGORIZING METHOD, DOCUMENT CATEGORIZING 
APPARATUS, AND STORAGE MEDIUM ON WHICH A DOCUMENT 
CATEGORIZATION PROGRAM IS STORED 

Inventors: Michihiro Nagaishi 
5 Shinji Miwa 

BACKGROUND OF THE INVENTION 

Background Art 

The present invention relates to a document categorizing method, a 
10 document categorizing apparatus, and a storage medium including a document 
categorizing program stored thereon, for categorizing a large number of documents 
into a plurality of clusters according to semantic similarity. 

In the art of categorizing a large number of documents into a plurality of 
clusters according to semantic similarity, it is known to extract feature elements 

15 from the respective documents and categorize the documents according to the 
extracted feature elements. One specific categorization method is to employ the 
whole of each document (the entire content of each document including a title and a 
body) as a target and extract feature elements from the entire contents of the 
respective documents. After that, the documents are categorized into a plurality of 

20 clusters on the basis of the extracted feature elements. 

In the case where feature elements are extracted from the entire contents of 
the respective documents, very complicated processing is needed in morphological 
analysis and extraction of feature elements. Thus, when such processing is 
performed by a central processing unit (CPU) in an information processing 

25 apparatus, a large load is imposed upon the CPU. Documents generally include 
many expressions having no direct relationships with the purport thereof. 
Therefore, if documents are categorized according to feature elements extracted by 
searching the entire contents of the respective documents, the resultant 
categorization is often meaningless. That is, a large number of noise clusters are 

30 created. 

One technique to solve the above problem is to first extract a title 
representing the purport of a document, then extract a feature element from the 
title, and finally categorize the document on the basis of the extracted feature 
element. 
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It is thought that this technique allows documents to be correctly categorized 
according to the purport of the respective documents. 

As described above, several techniques of categorizing documents into 
clusters are known. 

5 However, even when documents are categorized into clusters on the basis of 

feature elements extracted from the titles of the documents, the number of 
resultant clusters often becomes too great for user to use the resultant information. 
For example, when a large number of clusters obtained as a result of categorization 
are compared with one another, many same documents can be included in different 
10 clusters. In such a case, a user has to search the large number of presented clusters 
to find desired information. This is very inconvenient for users. 

In view of the above, it is an object of the present invention to provide a 
technique to recategorize a large number of categorized clusters into a simplified 
easily- understandable form by means of merging clusters. 

1 5, Brief Description of the Drawings 

Fig. 1 is a block diagram illustrating a first embodiment of the present 
invention. 

Fig. 2 shows examples of documents for illustration of the first embodiment of 
the present invention. 

20 Fig. 3 is a flow chart illustrating the outline of a document categorization 

process according to the first embodiment of the present invention. 

Fig. 4 illustrates an example of the content of a feature table representing the 
relationship between feature elements and documents. 

Fig. 5 illustrates a result obtained by categorizing the documents on the basis 
25 of the feature table shown in Fig. 4. 

Fig. 6 illustrates a process of merging two clusters, wherein examples of 
documents included in the respective clusters are shown. 

Fig. 7 illustrates a result obtained by performing a cluster merging process 
upon the categorization result shown in Fig. 5. 

30 Fig. 8 is block diagram of a document categorizing apparatus which performs 

a cluster merging process in accordance with in what manner feature elements 
appear in original documents. 
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Fig. 9 is a block diagram illustrating a second embodiment of the present 
invention. 

Fig. 10 shows examples of documents for illustration of the second 
embodiment of the present invention. 

5 Fig. 11 is a flow chart illustrating the outline of a document categorization 

process according to the second embodiment of the present invention. 

Fig. 12 illustrates an example of the content of a feature table representing 
the relationship between feature elements and documents. 

Fig. 13 illustrates a result obtained by categorizing the documents on the 
10 basis of the feature table shown in Fig. 12. 

Fig. 14 illustrates a process of merging two clusters, wherein examples of 
documents included in the respective clusters are shown. 

Fig. 15 illustrates a result obtained by performing a cluster merging process 
upon the categorization result shown in Fig. 13. 

1 5 Fig. 16 shows an example of a categorization result which is displayed such 

that the cluster names of clusters which have been merged into a final cluster are 
represented in an AND form (that is, the respective cluster names are placed in a 
single horizontal line). 

Fig. 17 shows an example of a categorization result which is displayed such 
20 that the cluster names of clusters which have been merged into a final cluster are 
represented in another AND form (that is, the respective cluster names are placed 
in different lines). 

Disclosure of the Invention 

To achieve the object described above, the present invention provides a 
25 document categorizing method for categorizing a plurality of documents into a 
plurality of clusters according to semantic similarity, the method being 
characterized in that after categorizing the plurality of documents into a plurality of 
clusters according to semantic similarity, and a cluster merging process is 
performed such that relations among clusters of the plurality of clusters are 
30 evaluated on the basis of documents included in the respective clusters, and two or 
more clusters having a degree of relation equal to or higher than a predetermined 
value are combined together. 

Preferably, the cluster merging process is performed such that the evaluation 
of relations among clusters under consideration as to whether they should be 
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merged or not is performed on the basis of the number of documents commonly 
included in the clusters under consideration relative to the total number of 
documents included in the clusters under consideration, and cluster merging is 
performed in accordance with the evaluation result. 

5 Alternatively, the cluster merging process may be performed such that in 

what manner feature elements, which characterize respective clusters under 
consideration as to whether they should be merged or not, appear in the respective 
clusters under consideration is examined, and cluster merging is performed in 
accordance with the manner in which the feature elements appear. 

10 Preferably, The cluster merging process is performed at least for two clusters, 

and after completion of the cluster merging process a first time, the cluster merging 
I process is performed repeatedly for the resultant set of clusters until no further 
cluster merging occurs. 

Preferably, after completion of the cluster merging process, supplementary 
15 information indicating that cluster merging has been performed and also indicating 
the basis on which the cluster merging has been performed is output. 

In the present invention, as described above, after categorizing documents 
into a plurality of clusters, the cluster merging process is performed such that 
relations among clusters of the plurality of clusters are evaluated on the basis of 

20 documents included in the respective clusters, and two or more clusters having a 
degree of relation equal to or higher than the predetermined value are combined 
together. Even when a large number of clusters have been generated in a first-time 
clustering process, the degrees of relations among the generated clusters are 
evaluated and clusters having high degrees of relations are combined together, and 

25 a simpHfied categorization result is presented to a user. This allows the user to find 
desired information in a highly efficient manner. 

Because the evaluation of relations among clusters under consideration as to 
whether they should be merged or not is performed on the basis of the number of 
documents commonly included in the clusters under consideration relative to the 
30 total number of documents included in the clusters under consideration, the cluster 
merging process can be performed easily and correctly. 

The evaluation of relations among clusters may be performed such that in 
what manner feature elements appear in the respective clusters under 
consideration as to whether they should be merged or not is examined, and cluster 
35 merging may be performed in accordance with the manner in which the feature 
elements appear. In this method, because the evaluation of the degree of relations 
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among clusters is performed on the basis of the actual contents of documents, the 
cluster merging process can be performed in a more proper fashion. 

The cluster merging process is performed for a combination of at least two 
clusters, and after completion of the cluster merging process, the cluster merging 
5 process is performed repeatedly for the set of clusters obtained in the previous 
cluster merging process until no further cluster merging occurs, thereby making it 
possible to obtain a simplified categorization result. 

After completion of the cluster merging process, supplementary information 
indicating that cluster merging has been performed and also indicating the basis on 
10 which the cluster merging has been performed is output. Thus, a use can know in 
what manner the cluster merging process has been performed. This makes it 
possible for the user to use the supplementary information to find desired 
information from the result of the cluster merging process. 

According to a second aspect of the present invention, there is provided a 
15 document categorizing method for categorizing a plurality of documents into a 
plurality of clusters according to semantic similarity, the method being 
characterized in that after categorizing the plurality of documents into a plurality of 
clusters according to semantic similarity, a cluster merging process is performed 
such that relations among clusters of the plurality of clusters are evaluated on the 
20 basis of documents included in the respective clusters, and two or more clusters 
^ having a degree of relation equal to or higher than a predetermined value are 
combined together; information representing which clusters have bee merged 
together and also representing the degrees of relation among the merged clusters is 
generated and the information is output together with the categorization result to 
25 be presented to a user so that when final clusters obtained as a result of the cluster 
merging process are displayed, the user can see in what manner the cluster merging 
process has been performed to obtain the final cluster. 

Preferably, the information output so as to enable the user to see in what 
manner the cluster merging process has been performed is given by modifying the 

30 manner of displaying the cluster names of respective clusters merged together in 
accordance with the degree of relation among the clusters merged together in such a 
manner that when the degree of relation among the clusters is higher than a 
predetermined value, the cluster names are displayed in an AND form, however 
when the degree of relation among the clusters is lower than the predetermined 

35 value, the cluster names are displayed in an OR form. 
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Preferably, when the cluster names are displayed in the AND form, the 
cluster names of the respective clusters are displayed successively in a single 
horizontal line or the respective cluster names are displayed in different lines, while 
when the cluster names are displayed in the OR form, a delimiter is inserted 
5 between adjacent cluster names of the respective clusters. 

When a certain cluster includes a cluster therein, the name of the cluster 
included in the certain cluster may be enclosed within brackets and placed after the 
name of the certain cluster. 

In the present invention as described above, cluster-merging-process 
10 information is generated which represents which clusters have been merged 
together and also represents the degrees of relation among the merged clusters, and 
the cluster-merging-process information is displayed when final clusters obtained 
via the cluster merging process are displayed so that a user can see in what manner 
the cluster merging process has been performed to obtain the final cluster. 

10^ This makes it possible for the user to easily understand which clusters have 

been merged into which final clusters and can know the degrees of relations among 
the clusters merged together, simply by seeing the information displayed. The 
information output so as to enable the user to see what relations the clusters have 
is given by modifying the manner of displaying the cluster names of respective 

20 clusters merged together in accordance with the degrees of relations among the 
^ clusters merged together. 

More specifically, when the degree of relation among clusters is higher than a 
predetermined value, the cluster names are displayed in an AND form, however 
when the degree of relation among the clusters is lower than the predetermined 

25 value, the cluster names are displayed in an OR form. For example, when the 
degree of relation is very high, the cluster names may be displayed successively in a 
single horizontal line or may be displayed in different hnes such that one name is 
placed in one line. In the case where the degree of relation is not very high, a 
delimiter may be inserted between adjacent cluster names. When the user sees the 

30 cluster names displayed in one of the above manners, he/she can understand from 
which clusters the cluster has been created via the cluster merging process and can 
know the degree of the relation among the original clusters. 

In the case where a certain cluster includes another cluster therein, the 
cluster name of the cluster included in the first cluster may be enclosed within 
35 brackets after the cluster name of the first cluster name. This allows the inclusive 
relation to be represented in a simple manner. 
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The present invention also provides a document categorizing apparatus for 
categorizing a plurality of documents into a plurality of clusters according to 
semantic similarity, the apparatus comprising a clustering unit for categorizing a 
plurality of documents into a plurality of clusters in accordance with semantic 
5 similarity, and a cluster merging unit which evaluates the relation among the 
plurality of clusters created by the clustering unit on the basis of the documents 
included in the respective clusters and then combines two or more clusters having a 
degree of relation equal to or higher than a predetermined value. 

Furthermore, the present invention also provides another document 
10 categorizing apparatus for categorizing a plurality of documents into a plurahty of 
clusters according to semantic similarity, the apparatus comprising a clustering 
unit for categorizing a plurality of documents into a plurality of clusters in 
ri accordance with semantic similarity, a cluster merging unit which evaluates the 
relation among the plurality of clusters created by the clustering unit on the basis of 
ISj the documents included in the respective clusters and then combines two or more 
I"'' clusters having a degree of relation equal to or higher than a predetermined value, 
m a cluster-merging-process information generator for generating cluster-merging- 
!^ process information representing which clusters have been merged together and 
m also representing the degrees of relation among the merged clusters wherein the 
2d3 cluster-merging-process information is to be displayed when final clusters obtained 
via the cluster merging process performed by the cluster merging unit are displayed 
;:«;= so that a user can see in what manner the cluster merging process has been 
performed to obtain the final cluster, and categorization result outputting means for 
outputting the cluster-merging-process information such that the cluster-merging- 
25 process information is included in the categorization result to be presented to the 
user. 

The present invention also provides a storage medium on which a document 
categorizing program for categorizing a plurality of documents into a plurahty of 
clusters according to semantic similarity is stored, the document categorizing 

30 program comprising a clustering step for categorizing a plurality of documents into 
a plurality of clusters in accordance with semantic similarity, and a cluster merging 
step in which the degrees of relation among clusters of the plurahty of clusters 
obtained in the clustering step are evaluated on the basis of documents included in 
the respective clusters, and two or more clusters having a degree of relation equal to 

35 or higher than a predetermined value are combined together. 

Furthermore, the present invention provides another storage medium on 
which a document categorizing program for categorizing a plurality of documents 
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into a plurality of clusters according to semantic similarity is stored, the document 
categorizing program comprising a clustering step for categorizing a plurality of 
documents into a plurality of clusters in accordance with semantic similarity, a 
cluster merging step in which the degrees of relation among clusters of the plurality 
5 of clusters obtained in the clustering step are evaluated on the basis of documents 
included in the respective clusters, and two or more clusters having a degree of 
relation equal to or higher than a predetermined value are combined together, a 
cluster-merging-process information generating step for generating cluster- 
merging-process information representing which clusters have been merged 

10 together and also representing the degrees of relation among the merged clusters 
wherein the cluster-merging-process information is to be displayed when final 
clusters obtained via the cluster merging process performed by the cluster merging 
unit are displayed so that a user can see in what manner the cluster merging 
process has been performed to obtain the final cluster, and a step for outputting the 

15 cluster-merging-process information such that the cluster-merging-process 
information is included in the categorization result to be presented to the user. 

Best Mode for Carrying Out the Invention 

■'J (First Embodiment) 

7- A first embodiment of the present invention is described below. Note that the 

20'. technique described herein can be applied not only to a document categorizing 
method and a document categorization apparatus according to the present 
invention but also to a document categorization program stored on a storage 
medium according to the present invention. 

In the present embodiment, titles are first extracted from respective 
25 documents, then feature elements are extracted from the titles, and finally the 
document is categorized according to the extracted feature elements. 

Fig. 1 illustrates the structure of the apparatus according to the present 
embodiment. As shown in Fig. 1, the apparatus consists mainly of a clustering unit 
1 for categorizing a plurality of documents into a plurality of clusters in accordance 

30 with semantic similarity, a cluster merging unit 2 which evaluates the relation 
among the plurahty of clusters created by the clustering unit 1 on the basis of the 
documents included in the respective clusters and then combines two or more 
clusters having a degree of relation equal to or higher than a predetermined value, 
and a categorization result outputting unit 3 for outputting the categorization result 

35 obtained via the cluster merging process performed by the cluster merging unit 2. 
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The clustering unit 1 includes a document storage unit 11, a sentence 
analyzer 12, a feature element extractor 13, a feature table generator 14, a 
document categorizing unit 15, and a categorization result storage unit 16. 

The cluster merging unit 2 serves to combine clusters, as will be described in 
5 detail later. 

The categorization result outputting unit 3 includes an output control unit 31 
and a display unit 32. The categorization result outputting unit 3 controls the 
process of outputting the result of the cluster merging process performed by the 
cluster merging unit 2. 

10 The document storage unit 11 included in the clustering unit 1 stores a large 

number of document data in a database. Hereinafter, it is assumed that documents 
shown in Fig. 2 are categorized. The documents shown in Fig. 2 include different 
;- documents Dl, D2,..., D7 including titles T2, T2,..., T7 and main bodies Al, Al,..., 
A7, respectively. 

15 The sentence analyzer 12 analyzes the documents stored in the document 

storage unit 11 to extract the titles of the respective documents. The extraction of 
the titles is performed by the document analyzer 12 as follows, 

: A first method is to detect a part defined as a title according to the document 

format, and employ the detected part as a title if such a part is detected. A second 

20 method is to detect a part specified to be displayed with characters having a greater 
size than a standard size, and employ the detected part as a title if such a part is 
detected. A third method is to extract a predetermined number of sentences or 
words described at the beginning of a document and employ the extracted sentence 
or words as a title. The first, second, and third methods described above are 

25 performed sequentially. When the first method is performed, if a part defined as a 
title is detected, the detected part is employed as the title. If a part defined as a 
title is not detected, the second method is performed. If a part specified to be 
displayed with large-sized characters is detected, the detected part is employed as 
the title. If a part specified to be displayed with large-sized characters is not 

30 detected, the third method is performed to detect a title. 

The feature element extractor 13 extracts feature elements from the 
respective document titles detected by the sentence analyzer 2. 

The feature table generating means 14 produces a feature table representing 
the relationship between the feature elements detected from the titles and the 
35 respective documents. A specific example of the content of the feature table will be 
described later. 
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The document categorizing unit 15 examines the content of the feature table 
and categorizes the documents Dl, D2,..., D7 into a plurality of clusters according to 
semantic similarity. More specifically, documents including a common feature 
element are detected on the basis of the feature elements included in the titles of 
5 the documents Dl, D2,..., D7, and the detected documents are categorized into a 
cluster. The document categorization unit 15 includes a synonymous feature 
dictionary (not shown). Categorization into clusters may also be performed using 
the synonymous feature dictionary as follows. When documents including a 
common feature element are detected, a judgment as to whether a common feature 
10 element is included or not is made by determining whether a synonym is included 
or not using the synonymous feature dictionary. If synonyms are detected in 
documents, those documents are categorized into the same cluster. 

The categorization result storage unit 16 stores the content obtained via the 
categorization performed by the document categorization unit 15. 

15^ The document categorization process performed by the apparatus constructed 

: in the above-described manner according to the present invention is described 
below. In the document categorization process according to the present 
embodiment, as shown in the flow chart of Fig. 3, a large number of documents to be 
processed are first categorized into a plurahty of clusters according to semantic 

20 similarity (step Si). Thereafter, relations among the clusters are evaluated on the 
~ basis of the documents included in the respective clusters as will be described in 
detail later (step S2). Two or more clusters having a degree of relation higher than 
a predetermined level are combined together (step S3). The document 
categorization process is described in further detail below with reference to a 

25 specific example. 

Herein, it is assumed that the documents Dl, D2,..., D7 shown in Fig. 2 are 
categorized. In the present embodiment, feature elements are extracted from the 
titles of the respective documents, and clustering is performed on the basis of the 
extracted feature elements. Thereafter, a cluster merging process is performed 

30 upon the result obtained through the above clustering process. First, the process of 
extracting feature elements from the titles and performing the clustering on the 
basis of the extracted feature elements (by the clustering unit 1) is described. The 
document analyzer 12 detects the titles of the respective documents Dl, D2,..., D7. 
For example, the title Tl is detected from the document Dl, the title T2 from the 

35 document D2, the title T3 from the document D3, and so on. Thus, the titles Tl, 
T2,..., T7 are detected from the respective documents Dl, D2,..., D7. 
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The feature element extractor 13 then extracts feature elements from the 
respective titles. Thereafter, the feature table generator 14 produces a feature table 
representing the relationships between the feature elements and the documents 
including the feature elements in their titles. An example of a feature table is 
5 shown in Fig. 4. In this example, the feature table represents the relationship 
between feature elements each included in three or more different documents and 
the documents including the feature elements. Numerals described in the feature 
table represent the numbers of feature elements included in the respective 
documents. For example, in the case of feature element "paper", one feature 
10 element is included in the title of each of the documents Dl, D4, D6, and D7. 

As can be seen from the feature table shown in Fig. 4, documents including 
"paper" as a feature element in their title are Dl, D4, D6, and D7, documents 
including "cassette" as a feature element in their title are Dl, D4, and D7, and 
documents including "installation" as a feature element in their title are D2, D3, D5 
15 and D7. In Fig. 2, these feature elements are underlined. 

The document analyzer 15 access the feature table and categorizes the 
respective documents into clusters for each feature element. The result of the 
categorization is shown in Fig. 5. In the categorization into clusters, as described 
earlier, the detection of feature elements commonly included in documents may be 
20 _ performed by detecting synonyms included in documents, using the synonymous 
feature dictionary, and documents including detected synonyms may be categorized 
into the same document cluster. For example, when "paper" and "printing paper" 
are extracted as feature elements, documents including either one of these feature 
elements are categorized into the same cluster. 

25 The obtained categorization result is stored in the categorization result 

storage unit 16. In the categorization result shown in Fig. 5, as for a cluster 
categorized as "paper" (including documents Dl, D4, D6, and D7), as can be 
understood from the document contents shown in Fig. 2, a paper cassette is 
described in the content of the document Dl, setting of paper in the document D4, a 

30 smear created on printed paper in the document D6, and installation of an 
additional paper cassette in the document D7. 

Thus, the categorization of the documents Dl, D4, D6, and D7 including a 
description about paper into the cluster can be regarded as proper. 

In the case of a cluster categorized as "cassette" (including documents Dl, D4, 
35 and D7), as can be understood from the document contents shown in Fig. 2, a paper 
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cassette is described in the content of the document Dl, setting of paper in the 
document D4, and installation of an additional paper cassette in the document D7. 

The contents of the documents Dl, D4, D6 and D7 include a description about 
setting of paper, and thus the categorization of these documents into the cluster can 
5 be regarded as proper. 

In the case of a cluster categorized as "installation" (including documents D2, 
D3, D5, and D7), as can be understood from the document contents shown in Fig. 2, 
installation of an additional memory is described in the content of the document D2, 
installation of an interface card in the document D3, installation of an additional 
10 hard disk in the document D5, and installation of an additional paper cassette is in 
the document D7. 

r The contents of the documents D2, D3, D5 and D7 include a description about 

installation of an additional component, and thus the categorization of these 
documents into the cluster can be regarded as proper. 

15 The reason why this technique allows documents to be properly categorized is 

=f that feature elements are first extracted from the titles of the respective documents, 
= and then the documents are categorized on the basis of the extracted feature 
elements. That is, in most cases, the titles of documents represent, in a simplified 
3 fashion, what is described in the contents of the documents. Therefore, if 
20': categorization is performed using feature elements included in the titles of 
- documents, scattering into a large number of clusters can be prevented, and the 
probability of generation of noise clusters is reduced. Furthermore, because the 
titles are created by the authors of the documents so as to shortly represent what is 
described in the documents, categorization on the basis of the author's viewpoints 
25 can be obtained. 

After completion of categorization, if a user issues a command to select a 
cluster of "paper", documents Dl, D4, D6, and D7 categorized in that cluster are 
read from the document storage unit 11 and displayed on the display unit 32. 
Herein, only the document numbers or document names may be displayed, or 
30 otherwise the contents of the documents may be displayed. 

In the present invention, after the clustering process described above, the 
cluster merging unit 2 performs a cluster merging process. 

That is, in the categorization result shown in Fig. 5, the cluster of "paper" 
includes documents Dl, D4, D6, and D7, and the cluster of "cassette" includes 
35 documents Dl, D4, and D7. 
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Thus, documents Dl, D4, and D7 are included in both clusters of "paper" and 
"cassette". This means that a feature element of "paper" and a feature element of 
"cassette" have a close relation with each other. For example, an expression of 
"paper cassette" is included in the title or the main body of the documents Dl, D4, 
5 and D7, and thus these documents Dl, D4, and D7 can be regarded as having a 
close relation. Therefore, it is more desirable that these documents Dl, D4, and D7 
be categorized into the same cluster. 

In the present invention, to the above end, after performing the clustering on 
the basis of the feature elements, the cluster merging process is performed upon the 
10 result of the clustering. 

The cluster merging process is described below. First, aside from the 
categorization result shown in Fig. 5, a general example is described with reference 
to Fig. 6. 

We assume here that there are two clusters Cl and C2, wherein the cluster 
15, Cl includes five documents Dl, D2, D3, D4, and D8, and the cluster C2 includes six 
documents D3, D4, D5, D6, D7, and D8. 

Documents which are commonly included in both clusters Cl and C2 are D3, 
D4, and D8. In the present embodiment, the degree of relation among a plurality of 
clusters is evaluated on the basis of the number of documents which are commonly 
20 included in the plurality of clusters, and clusters are merged depending upon the 
evaluated degree of relation. 

More specifically, the ratio of the number of documents which are commonly 
included in two certain clusters to the total number of documents included in those 
two clusters is calculated, and a decision as to whether these two clusters should be 
25 merged is made depending upon whether the calculated ratio is equal to or greater 
than a predetermined threshold value. 

In this specific example, the total number of documents included in the two 
clusters Cl and C2 is eleven, and three documents are commonly included in both 
clusters. Thus, the ratio (%) of the number of common documents to the total 

30 number of documents can be calculated, and the decision as to whether merging 
should be performed is made in accordance with the calculation result. In the 
calculation of the ratio (%), the ratio may be determined simply by dividing the 
number of common documents by the total number of documents and further 
multiplying the result by 100, or the ratio may be determined by dividing the 

35 product of the number of common documents and a predetermined arbitrary factor 
by the total number of documents and then multiplying the result by 100. 
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As an example, let us assume that the number of documents included in the 
cluster CI is equal to al, the number of documents included in the cluster C2 is 
equal to a2, and the number of documents which are commonly included in both 
clusters CI and C2 is equal to p. After multiplying P by a factor of, for example, 2, 
5 2p/(al + a2) x 100 is calculated. The result (%) is compared with a predetermined 
threshold value TH (%). If the calculated result is equal to or greater than the 
threshold value TH, then merging is performed. In the example shown in Fig. 6, 2P 
= 2x3 = 6, and al + a2 = 5 + 6 = 11, and thus the ratio is calculated as 55%. If the 
threshold value TH is set to 70%, the calculated ratio (55%) is smaller than the 
10 threshold value TH (70%), and thus it is determined that the clusters Cl and C2 
should not be merged. The above factor may be set to an arbitrary value such that 
the calculated ratio (%) falls within a range which is proper for comparison with the 
threshold value. Thus, although the factor is set to 2 in the above example, the 
factor may be set to 1. 

15 Referring back to the categorization result shown in Fig. 5, the cluster of 

=r= "paper" includes four documents Dl, D4, D6, and D7, and the cluster of "cassette" 
includes three documents Dl, D4, and D7. Three documents Dl, D4, and D7 are 
commonly included in both clusters. Now, we calculate the ratio (%) of the number 
of common documents to the total number of documents. 

20; The calculation is performed in accordance with the formula described above. 

In the case of the categorization result shown in Fig. 5, the total number of 
documents (al + a2) is calculated as 4 + 3 = 7, the number of common documents is 
equal to 3, and thus 2p is calculated as 6. In this case, the ratio becomes as high as 
about 86%. Because the calculated ratio is greater than the predetermined 

25 threshold value (70% in this example), it is determined that the cluster of "paper" 
and the cluster of "cassette" should be merged into a single cluster. 

Similarly, decisions as to whether the cluster of "paper" and the cluster of 
"installation" should be merged and whether the cluster of "cassette" and the cluster 
of "installation" should be merged are made as follows. 

30 As for the clusters of "paper" and "installation", the cluster of "paper" 

includes four documents Dl, D4, D6 and D7, and the cluster of "installation" 
includes four documents D2, D3, D5, and D7. Only one document D7 is commonly 
included in both clusters. Thus, according to the formula described above, the ratio 
is calculated as 25%, which is lower than the threshold value (70%). Therefore, it is 

35 determined that these clusters should not be merged. 
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As for the clusters of "cassette" and "installation", the cluster of "cassette" 
includes three documents Dl, D4, and D7, the cluster of "installation" includes four 
documents D2, D3, D5, and D7, and only one document D7 is commonly included in 
both clusters. Thus, according to the formula described above, the ratio is 
5 calculated as 28%, which is also lower than the threshold value (70%). Therefore, it 
is determined that these clusters should not be merged. 

As described above, it is determined whether merging should be performed 
for each combination of two clusters. The result of categorization (recategorization 
by merging) performed upon the categorization result shown in Fig. 5 is shown in 
10 Fig. 7. In Fig. 7, the cluster of "paper" and the cluster of "cassette" are combined 
into a single cluster of "paper + cassette" including documents Dl, D4, D6, and D7. 
On the other hand, the cluster of "installation" remains in the original state without 
being combined with another cluster. 

Referring to Fig. 7, the recategorization result obtained through the cluster 
] 5 merging process indicates that in the cluster "paper + cassette" (including 
documents Dl, D4, D6, and D7), as can be understood from the document contents 
shown in Fig. 2, a paper cassette is described in the content of the document Dl, 
setting of paper is described in the document D4, a method of handling which 
should be performed when printed paper becomes dirty is described in the document 
20- D6, and installation of an additional paper cassette is described in the document 
D7. 

The contents of the documents Dl, D4, D6 and D7 include a description about 
paper or a cassette, and thus the recategorization of these documents into the single 
cluster can be regarded as proper. As a matter of fact, merging into the single 
25 cluster of "paper + cassette" results in better categorization. 

As described above, a better result can be obtained by first extracting feature 
elements from the titles of the respective documents, then performing the clustering 
process on the basis of the extracted feature elements, and finally performing the 
cluster merging process for each combination of two clusters of the clusters obtained 
30 via the above clustering process. 

After the first cluster merging process for each combination of two clusters, 
the result of the recategorization by the cluster merging is obtained as shown in 
Fig. 7. Thereafter, a second cluster merging process is performed upon the 
recategorization result obtained through the first cluster merging process. That is, 
35 in the result of the first cluster merging process shown in Fig. 7, a cluster merging 
process is performed for a combination of the cluster of "paper + cassette" and the 
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cluster of "installation". In this example, as for the combination of clusters of 
"paper + cassette" and "installation", the cluster "paper + cassette" includes four 
documents Dl, D4, D6, the cluster of "installation" includes four document D2, D3, 
D5, and D7, and only one document D7 is included in both clusters. The ratio (%) of 
5 the number of common documents to the total number of documents is calculated as 
follows. The number of common document, equal to 1, is first multiplied by a factor 
of 2 and then divided by the total number of documents, equal to 8, and further 
multiplied by 100, and thus the result is obtained as 25%, which is lower than the 
threshold value (70%). Thus, it is determined that these clusters should not be 
10 merged. 

As described above, after completion of the first cluster merging process for 
each combination of two clusters, the second cluster merging process is performed 
for each combination of two clusters of the clusters obtained via the fist cluster 
merging process. After completion of the second cluster merging process, a third 
1 5 cluster merging process is performed for each combination of two clusters of the 
clusters obtained via the second cluster merging process. The above process is 
performed repeatedly until no new cluster is created (until no clusters are merged). 

Although in the above example, the cluster merging process is performed for 
a combination of two clusters, the cluster merging process may be performed for a 

20 combination of three or more clusters. In this case, in a first cluster merging 
process, cluster merging is performed for each combination of three or more 
clusters. Thereafter, cluster merging may be performed repeatedly for the result 
obtained via the previous cluster merging process until no further merging occurs. 
As in the previous case, the judgment as to whether three or more clusters should 

25 be merged can be performed on the basis of the ratio (%) of the number of common 
documents to the total number of documents included in these clusters. 

In the above-described cluster merging process for combinations of a plurahty 
of clusters, the ratio of the number of documents commonly included in the clusters 
to the total number of documents is calculated from the categorization result such 

30 as that shown in Fig. 5, and the ratio is compared with the predetermined threshold 
value. Alternatively, the judgment as to whether clusters should be merged or not 
can be made by examining in what manner feature elements characterizing the 
respective clusters are used in the original documents. An example of an apparatus 
for performing the cluster merging process in the above-described manner is shown 

35 in Fig. 8. The apparatus shown in Fig. 8 is includes the same constituent parts as 
those shown in Fig. 1 and they are denoted by the same reference numerals. 
However, the difference is in that the output of the document storage unit 11 is 
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applied to the cluster merging unit 2 so that the decision as to whether cluster 
merging should be preformed can be made on the basis of the document contents, as 
will be described below. 

Herein, let us assume that the cluster merging process is performed for the 
5 clusters of "paper" and "cassette" shown in Fig. 5. The cluster of "paper" includes 
documents Dl, D4, D6, and D7, and the cluster of "cassette" includes documents Dl, 
D4, and D7. 

These documents are examined to detect in what manner the words "paper" 
and "cassette" are used in the documents. In the document Dl, a phrase "paper 

10 cassette" which is a combination of "paper" and "cassette" appears at plural 
locations. The document D4 also includes a phrase "paper cassette". Furthermore, 
in the document D4, "paper" and "cassette" appear at close locations. The document 
D7 also includes a phrase "paper cassette" and a phrase "paper cassette unit". 
Although the document D6 does not include a word "cassette", a word "paper" 

1 5 appears at plural locations. 

Form the above, it can be concluded that words "paper" and "cassette" 
extracted as feature elements are used at adjacent or close locations, and thus they 
=- can be regarded as having a close relation. Thus, at least documents Dl, D4, and 
= D7 have a close relation, and the document D6 has a relation to some extent. 
20 Therefore, it can be concluded that the clusters of "paper" and "cassette" can be 
combined properly into a single cluster of "paper + cassette". 

Thereafter, the cluster merging process is performed for the clusters of 
"paper" and "installation". The cluster of "paper" includes documents Dl, D4, D6, 
and D7, and the cluster of "installation" includes documents D2, D3, D5, and D7. 

25 These documents are examined to detect in what manner the words "paper" 

and "cassette" are used in the documents. In the documents Dl, D2, D3, D4, D5, 
and D6, "paper" and "installation" do not appear at adjacent or close locations. Only 
in the document D7, "paper cassette" and "installation" appear at close locations. 

Therefore, it can be concluded that "paper" and "installation" extracted as 
30 feature elements are not frequently used at adjacent or close locations, and thus 
they can be regarded as having little relation. Thus, it is determined that the 
clusters of "paper" and "installation" should not be merged. 

In the case of the cluster merging process for a combination of the clusters of 
"cassette" and "installation", as in the case of the combination of the clusters of 
35 "paper" and "installation", "cassette" and "installation" are not used at adjacent or 
close locations. 
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Therefore, it can be concluded that "cassette" and "installation" extracted as 
"feature elements" are not frequently used at adjacent or close locations, and thus 
they can be regarded as having little relation. Thus, it is determined that the 
clusters of "cassette" and "installation" should not be merged. 

5 Also in the case where cluster merging is performed in the above-described 

manner depending upon in what manner feature elements characterizing the 
respective clusters are used in the original documents, after completion of a first 
cluster merging process for each combination of clusters, a second cluster merging 
process is performed for each combination of two clusters of the clusters obtained 
10 via the fist cluster merging process. After completion of the second cluster merging 
process, a third cluster merging process is performed for each combination of two 
clusters of the clusters obtained via the second cluster merging process. The above 
process is performed repeatedly until no further cluster is created (until no clusters 

" 4 are merged). 

IpJ Although in the Eibove example, the cluster merging process is performed for 

'■Z^ a combination of two clusters, the cluster merging process may be performed for a 
11 combination of three or more clusters. In this case, in a first cluster merging 
;^ process, cluster merging is performed for each combination of three or more 
fy clusters. Thereafter, cluster merging may be performed repeatedly for the result 

203 obtained via the previous cluster merging process until no further merging occurs. 

□ It is desirable that when the result obtained via the cluster merging process 

is presented to a user, information representing how the cluster merging process 
has been performed be also presented together with the above result to the user. 
This can be achieved if the information representing in what manner the cluster 
25 merging process has been performed by the cluster merging unit 2 is supplied to the 
output control unit 31, and the output control unit 31 displays the received 
information on the display unit 32. 

Note that the present embodiment is not limited to the specific examples 
described above, but various modifications are possible without departing from the 

30 spirit of the embodiment. For example, although in the above example, feature 
elements to be used to obtain a categorization result such as that shown in Fig. 5 
are extracted from the titles of the respective documents, and clustering is 
performed on the basis of the feature elements extracted from the titles, what is 
essential to the present embodiment is that after categorizing documents into 

35 clusters according to semantic similarity, similar clusters are merged. Therefore, 
the manner of clustering a plurality of documents is not limited to a particular 
method. For example, instead of clustering documents on the basis of feature 
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elements extracted from the titles of the documents according to the above 
embodiment, clustering may also be performed according to URL addresses (after 
removing "http://", the remaining part is employed), updated date/time (without any 
restriction or within last one moth), or file sizes (the sizes of the Web pages in 
5 bytes). One of these items or some combination of these items may be employed in 
the clustering process. A desired item can be selected, for example, from a menu. 
In the case where a selected item is not included in a document, another item may 
be employed instead of the selected item. For example, when the title is selected as 
the item, if a Web page does not include a title, a URL address may be employed. 

10 After performing the clustering using one of the methods, the judgment as to 

whether clusters should be merged or not is made by evaluating the similarity 
between the clusters under consideration in the manner described above. 

For example, let us consider an example in which clustering is performed 
according to URL addresses. We assume herein that documents have been 
15 categorized into a cluster of a certain URL (URLl) and a cluster of another URL 
(URLl). We further assume that the cluster of URLl includes documents Dl, D2, 
D3, and D4, and the cluster of URL2 includes documents D2, D3, D4, D5. In this 
case, documents which are commonly included in both clusters are D2, D3, and D4. 
I The ratio of the number of common documents to the total number of documents is 
20 calculated, and it is determined according to the calculated ratio whether the 
cluster of URLl and the cluster of URL2 should be merged or not. 

Although in the above embodiment, the judgment as to whether clusters 
should be merged or not is made by comparing the ratio (%) of the number of 
documents commonly included in clusters under consideration to the total number 
25 of documents to a predetermined threshold value (%), the manner of the judgment 
is not hmited to that. For example, the judgment as to whether clusters should be 
merged or not may be made in accordance with the number of common documents 
relative to the numbers of documents included in the respective clusters. 

In the above embodiment, different documents Dl, D2,..., D7 are categorized. 
30 The embodiment may also be apphed to the case where a single document is divided 
into a plurality of contents (into parts having their own themes) and the respective 
contents are categorized. Herein, let us assume that contents are given by dividing 
a single document at respective titles into plural parts each describing their own 
particular themes. 

35 For example, if it is assumed that the documents Dl, D2,..., D7 shown in Fig. 

7 are parts of the same single document, these documents Dl, D2,...., D7 can be 
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regarded as contents in the above-described sense. In this case, the respective 
contents include titles Tl, T2,..., T7 and main bodies Al, A2,..., A7. 

As described above, the present embodiment may be applied to the case when 
a signal document is divided into a plurality contents, then the contents are 
5 categorized into clusters, and finally similar clusters in the obtained clusters are 
merged. 

Furthermore, the present embodiment may also be applied to a plurality of 
documents obtained via a general retrieval service. In this case, the clustering 
process is first performed for a large number of documents obtained via the 
10 retrieval, and the cluster merging process is then performed for the result of the 
clustering process. 

A program used to execute the above-described document categorizing 
process according to the present embodiment may be stored on a storage medium 
such as a floppy disk, an optical disk, or a hard disk. Note that such a storage 
15 medium also falls within the scope of the present invention. The program may also 
be obtained via a network. 

(Second Embodiment) 

When merged clusters are presented to a user, if only the final result of the 
" cluster merging process is presented and no information about the cluster merging 
20 process is presented, the user cannot know which clusters have been combined 
together into final clusters and cannot know the degree of relation among the 
original clusters combined together into the final clusters. 

In the present embodiment, to solve the above problem, when final clusters, 
which are obtained through the cluster merging process in which clusters having 
25 close relations are combined together, are displayed, the clusters are displayed in a 
manner that allows a user to see which clusters have been combined together into 
which final clusters and also see the degrees of relation among the clusters 
combined together. 

The second embodiment of the present invention is described in further detail 

30 below. 

In this second embodiment, categorization of documents is performed, as 
described above, by first extracting the titles of the respective documents, then 
extracting feature elements from the titles, and finally categorizing the documents 
on the basis of the extracted feature elements. 
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Fig. 9 illustrates the second embodiment. As shown in Fig. 9, an apparatus 
of the second embodiment consists mainly of a clustering unit 91 for categorizing 
documents into a plurality of clusters in accordance with semantic similarity, a 
cluster merging unit 92 which evaluates the relation among the plurality of clusters 
5 created by the clustering unit 91 on the basis of the documents included in the 
respective clusters and then combines two or more clusters having a degree of 
relation equal to or higher than a predetermined value, a cluster-merging-process 
information generator 93 for generating cluster-merging-process information 
representing which clusters have been combined together and also representing the 
10 degrees of relations among the combined clusters wherein the cluster-merging- 
process information is to be displayed when final clusters obtained via the cluster 
merging process performed by the cluster merging unit 2 are displayed, and a 
categorization result outputting unit 94 for outputting the categorization result 
including the cluster-merging-process information. 

iC" The clustering unit 91 includes a document storage unit 911, a sentence 

analyzer 912, a feature element extractor 913, a feature table generator 914, a 
i; document categorizing unit 915, and a categorization result storage unit 16. 

The document storage unit 911 stores, in the form of a database, a large 
=1; number of document data. Hereinafter, it is assumed that the documents shown in 
2Q: Fig. 10 are categorized. The documents shown in Fig. 10 include different 
- documents Dl, D2,..., D7 including titles T2, T2,..., T7 and main bodies Al, Al,..., 
Al, respectively. 

The sentence analyzer 912 analyzes the documents stored in the document 
storage unit 911 to extract the titles of the respective documents. The extraction of 
25 the titles is performed by the document analyzer 912 as follows. 

A first method is to detect a part defined as a title according to the document 
format, and employ the detected part as a title if such a part is detected. A second 
method is to detect a part specified to be displayed with characters having a greater 
size than a standard size, and employ the detected part as a title if such a part is 

30 detected. A third method is to extract a predetermined number of sentences or 
words located at the beginning of a document and employ the extracted sentence or 
words as a title. The first, second, and third methods described above are 
performed sequentially. When the first method is performed, if a part defined as a 
title is detected, the detected part is employed as the title. If a part defined as a 

85 title is not detected, the second method is performed. If a part specified to be 
displayed with large-sized characters is detected, the detected part is employed as 
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the title. If a part specified to be displayed with large-sized characters is not 
detected, the third method is performed to detect a title. 

The feature element extractor 913 extracts a feature element from the 
respective document titles detected by the sentence analyzer 2. 

5 The feature table generating means 914 produces a feature table 

representing the relationship between the feature elements detected from the titles 
and the respective documents. A specific content of the feature table will be 
described later. 

The document categorizing unit 915 examines the content of the feature table 
10 and categorizes the documents Dl, D2,..., D7 into a plurality of clusters according to 
semantic similarity. Documents including a common feature element are detected 
on the basis of the feature elements included in the titles of the documents Dl, 
D2,..., D7, and the detected documents are categorized into a cluster. The document 
categorization unit 915 includes a synonymous feature dictionary (not shown). 
15 Categorization into clusters may also be performed using the synonymous feature 
dictionary as follows. When documents including a common feature element are 
detected, a judgment as to whether a common feature element is included or not is 
made by determining whether a synonym is included or not using the synonymous 
feature dictionary. If synonyms are detected in documents, those documents are 
20_ categorized into the same cluster. 

The categorization result storage unit 916 stores the content obtained via the 
categorization performed by the document categorization unit 915. 

The cluster merging unit 92 evaluates the relation among the plurality of 
clusters on the basis of the documents included in the respective clusters and then 
25 combines two or more clusters the degree of relation among which is equal to or 
higher than a predetermined value, as will be described in detail later. 

The cluster-merging-process information generator 93 includes a relation 
evaluator 931 and a manner-of-displaying-cluster-name determiner 932, wherein 
the relation evaluator 931 evaluates the degree of relation among clusters by 

30 comparing a cluster correlation score (described later) generated by the cluster 
merging unit 92 with a predetermined threshold value (described later), and the 
manner-of-displaying-cluster-name determiner 932 determines the manner of 
displaying cluster names so as to indicate which clusters have been combined 
together and indicate the degree of relation among the combined clusters, on the 

35 basis of the degree of relation evaluated by the relation evaluator 931. The 
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processing performed by the relation evaluator 931 and the manner-of-displaying- 
cluster-name determiner 932 will be described in further detail later. 

The categorization result output unit 94 includes an output control unit 941 
and a display unit 942 and serves to output the document categorization result 
5 obtained according to the present invention. 

The document categorization process performed by the apparatus constructed 
in the above-described manner according to the present invention is described 
below. The outline of the document categorization process according to the present 
embodiment is as follows. As shown in a flow chart in Fig. 11, a large number of 
10 documents to be processed are first categorized into a plurality of clusters according 
to semantic similarity (step llSl). Thereafter, the degrees of relation among 

- clusters are evaluated on the basis of the documents included in the respective 
clusters (step 11S2). Two or more clusters having a degree of relation higher than a 

- predetermined value are combined together (step 11S3). Thereafter, cluster- 
15- merging-process information is generated which indicates which clusters have been 

merged into final clusters and also indicates the degrees of relation among the 
original clusters combined together. More specifically, the degrees of relation 
among the clusters which have been merged are determined (step 11S4), and 
cluster-merging-process information is generated on the basis of the degrees of 

20 relation so that the cluster-merging-process information represents the properties of 
: the original clusters combined together into the final clusters, that is, so that the 
cluster-merging-process information indicates which clusters have been combined 
together into final clusters and also indicates the degrees of relation among the 
original clusters combined together (step 11S5). The document categorization 

25 process is described in further detail below with reference to a specific example. 

Herein, it is assumed that the documents Dl, D2,..., D7 shown in Fig. 10 are 
categorized. In the present embodiment, feature elements are extracted from the 
titles of the respective documents, and clustering is performed on the basis of the 
extracted feature elements. Thereafter, obtained clusters are merged. First, the 
30 process of extracting feature elements from the titles and performing the clustering 
on the basis of the extracted feature elements (by the clustering unit 1) is described. 

The document analyzer 12 detects the titles of the respective documents Dl, 
D2,..., D7. For example, the title Tl is detected from the document Dl, the title T2 
from the document D2, the title T3 from the document D3, and so on. Thus, the 
35 titles Tl, T2,..., T7 are detected from the respective documents Dl, D2,..., D7. 
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The feature element extractor 913 then extracts feature elements from the 
respective titles. Thereafter, the feature table generator 914 produces a feature 
table representing the relationships between the feature elements and the 
documents including the feature element in their titles. An example of a feature 
5 table is shown in Fig. 12. In this example, the feature table represents the 
relationships between feature elements each included in three or more different 
documents and the documents including the feature elements. Numerals described 
in the feature table represent the numbers of feature elements included in the 
respective documents. For example, in the case of feature element "paper", one 
10 feature element is included in the title of each of the documents Dl, D4, D6, and D7. 

As can be seen from the feature table shown in Fig. 12, documents including 
"paper" as a feature element in their title are Dl, D4, D6, and D7, documents 
including "cassette" as a feature element in their title are Dl, D4, and D7, and 
documents including "installation" as a feature element in their title are D2, D3, D5 
15 and D7. In Fig. 10, these feature elements are underlined. 

The document analyzer 915 access the feature table and categorizes the 
respective documents into clusters for each feature element. The result of the 
categorization is shown in Fig. 13. In the categorization into clusters, as described 
earlier, the detection of feature elements commonly included in documents may be 
20" performed by detecting synonyms included in documents, using the synonymous 
!^ feature dictionary, and documents including detected synonyms may be categorized 
into the same document cluster. For example, when "paper" and "printing paper" 
are extracted as feature elements, documents including either one of these feature 
elements are categorized into the same cluster. 

25 The obtained categorization result is stored in the categorization result 

storage unit 916. In the categorization result shown in Fig. 13, as for a cluster 
categorized as "paper" (including documents Dl, D4, D6, and D7), as can be 
understood from the document contents shown in Fig. 10, a paper cassette is 
described in the content of the document Dl, setting of paper in the document D4, a 

30 smear created on printed paper in the document D6, and installation of an 
additional paper cassette in the document D7. 

Thus, the categorization of the documents Dl, D4, D6, and D7 including a 
description about paper into the cluster can be regarded as proper. 

In the case of a cluster categorized as "cassette" (including documents Dl, D4, 
35 and D7), as can be understood from the document contents shown in Fig. 10, a 
paper cassette is described in the content of the document Dl, setting of paper in 
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the document D4, and installation of an additional paper cassette in the document 
D7. 

The contents of the documents Dl, D4, D6 and D7 include a description about 
setting of paper, and thus the categorization of these documents into the cluster can 
5 be regarded as proper. 

In the case of a cluster categorized as "installation" (including documents D2, 
D3, D5, and D7), as can be understood from the document contents shown in Fig. 
10, installation of extension memory is described in the content of the document D2, 
installation of an interface card in the document D3, installation of an additional 
10 hard disk in the document D5, and installation of an additional paper cassette in 
the document D7. 

The contents of the documents D2, D3, D5 and D7 include a description about 
installation of an additional part, and thus the categorization of these documents 
into the cluster can be regarded as proper. 

15 The reason why this technique allows documents to be properly categorized is 

that feature elements are first extracted from the titles of the respective documents, 
and then the documents are categorized on the basis of the extracted feature 
elements. That is, in most cases, the titles of documents represent, in a simplified 
fashion, what is described in the contents of the documents. Therefore, if 

20; categorization is performed using feature elements included in the titles of 
documents, scattering into a large number of clusters can be prevented, and the 
probability of generation of noise clusters is reduced. Furthermore, because the 
titles are created by the authors of the documents so as to shortly represent what is 
described in the documents, categorization on the basis of the author's viewpoints 

25 can be obtained. 

After completion of categorization, if a user issues a command to select a 
cluster of "paper", documents Dl, D4, D6, and D7 categorized in that cluster are 
read from the document storage unit 11 and displayed on the display unit 32. 
Herein, only the document numbers or document names may be displayed, or 
30 otherwise the contents of the documents may be displayed. 

In the present invention, after the clustering described above, the cluster 
merging unit 2 performs a cluster merging process. 

That is, in the categorization result shown in Fig. 13, the cluster of "paper" 
includes documents Dl, D4, D6, and D7, and the cluster of "cassette" includes 
35 documents Dl, D4, and D7. 
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Thus, documents Dl, D4, and D7 are included in both clusters of "paper" and 
"cassette". This means that a feature element of "paper" and a feature element of 
"cassette" have a close relation with each other. For example, an expression of 
"paper cassette" is included in the title or the main body of the documents Dl, D4, 
5 and D7, and thus these documents Dl, D4, and D7 can be regarded as having a 
close relation. Therefore, it is more desirable that these documents Dl, D4, and D7 
be categorized into the same cluster. 

To the above end, after performing the clustering on the basis of the feature 
elements, the cluster merging process is performed upon the result of the clustering. 

10 The cluster merging process is described below. First, aside from the 

categorization result shown in Fig. 13, a general example is described with 
reference to Fig. 14. 

We assume here that there are two clusters CI and C2, wherein the cluster 
CI includes five documents Dl, D2, D3, D4, and D8, and the cluster C2 includes six 
15 documents D3, D4, D5, D6, D7, and D8. 

; Documents which are commonly included in both clusters Cl and C2 are D3, 

D4, and D8. In the present embodiment, the degree of relation among a plurality of 

I clusters is evaluated on the basis of the number of documents which are commonly 
included in the plurality of clusters, and clusters are merged depending upon the 
20. evaluated degree of relation. 

More specifically, the ratio of the number of documents which are commonly 
included in two certain clusters to the total number of documents included in those 
two clusters is calculated, and a decision as to whether these two clusters should be 
merged is made depending upon whether the calculated ratio is equal to or greater 
25 than a predetermined threshold value. 

In this specific example, the total number of documents included in the two 
clusters Cl and C2 is eleven, and three documents are commonly included in both 
clusters. Thus, the ratio (%) of the number of common documents to the total 
number of documents can be calculated, and the decision as to whether merging 

30 should be performed is made in accordance with the calculation result. When the 
ratio (%) is calculated, the ratio may be determined simply by diving the number of 
common documents by the total number of documents and further multiplying the 
result by 100, or the ratio may be determined by dividing the product the number of 
common documents and a predetermined arbitrary factor by the total number of 

35 documents and then multiplying the result by 100. 
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As an example, let us assume that the number of documents included in the 
cluster Cl is equal to al, the number of documents included in the cluster C2 is 
equal to a2, and the number of documents which are commonly included in both 
clusters Cl and C2 is equal to (5. After multiplying P by a factor of, for example, 2, 
5 and then 2p/(al + a2) x 100 is calculated. The result (%) is compared with a 
predetermined threshold value TH (%). If the calculated result is equal to or 
greater than the threshold value TH, then merging is performed. In the example 
shown in Fig. 14, 2p = 2 x 3 = 6, and al + a2 = 5 + 6 = 11, and thus the ratio is 
calculated as 55%. If the threshold value TH is set to 70%, the calculated ratio 
10 (55%) is smaller than the threshold value TH (70%), and thus it is determined that 
the clusters Cl and C2 should not be merged. The above factor may be set to an 
arbitrary value such that the calculated ratio (%) falls within a range which is 
proper for comparison with the threshold value. Thus, although the factor is set to 
2 in the above example, the factor may be set to 1. 

15 Referring back to the categorization result shown in Fig. 13, In this example, 

the cluster of "paper" includes four documents Dl, D4, D6, and D7, and the cluster 
of "cassette" includes three documents Dl, D4, and D7. Three documents Dl, D4, 
and D7 are commonly included in both clusters. Now, we calculate the ratio (%) of 
= the number of common documents to the total number of documents. 

20' The calculation is performed in accordance with the formula described above, 

; In the case of the categorization result shown in Fig. 13, the total number of 
^ documents (al + a2) is calculated as 4 + 3 = 7, the number of common documents is 
equal to 3, and thus 2P is calculated as 6. In this case, the ratio becomes as high as 
about 86%. Because the calculated ratio is greater than the predetermined 
25 threshold value (70% in this example), it is determined that the cluster of "paper" 
and the cluster of "cassette" should be merged into a single cluster. 

Similarly, decisions as to whether the cluster of "paper" and the cluster of 
"installation" shown in Fig. 13 should be merged and whether the cluster of 
"cassette" and the cluster of "installation" should be merged are made as follows. 

30 As for the clusters of "paper" and "installation", the cluster of "paper" 

includes four documents Dl, D4, D6, and D7, the cluster of "installation" includes 
four documents D2, D3, D5, and D7, and only one document D7 is included in both 
clusters. Thus, according to the formula described above, the ratio is calculated as 
25%, which is lower than the threshold value (70%). Therefore, it is determined 

35 that these clusters should not be merged. 
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As for the clusters of "cassette" and "installation", the cluster of "cassette" 
includes three documents Dl, D4, and D7, the cluster of "installation" includes four 
documents D2, D3, D5, and D7, and only one document D7 is included in hoth 
clusters. Thus, according to the formula described above, the ratio is calculated as 
5 28%, which is also lower than the threshold value (70%). Therefore, it is 
determined that these clusters should not be merged. 

As described above, it is determined whether merging should be performed 
for each combination of two clusters. The result of categorization (recategorization 
by merging) performed upon the categorization result shown in Fig. 13 is shown in 
10 Fig. 15. In Fig. 15, the cluster of "paper" and the cluster of "cassette" are combined 
into a single cluster of "paper + cassette" including documents Dl, D4, D6, and D7. 
On the other hand, the cluster of "installation" remains in the original state without 
;S being combined with another cluster. 

Referring to Fig. 15, the recategorization result obtained through the cluster 
ISj merging process indicates that in the cluster "paper + cassette" (including 
documents Dl, D4, D6, and D7), as can be understood from the document contents 
m shown in Fig. 10, a paper cassette is described in the content of the document Dl, 
setting of paper in the document D4, a method of handhng which should be 
=[1 performed when printed paper becomes dirty is described in the document D6, and 
2d3 installation of an additional paper cassette in the document D7. 

Q The contents of the documents Dl, D4, D6 and D7 include a description about 

paper or a cassette, and thus the recategorization of these documents into the single 
cluster can be regarded as proper. As a matter of fact, merging into the single 
cluster of "paper + cassette" results in better categorization. 

25 As described above, a better result can be obtained by first extracting feature 

elements from the titles of the respective documents, and clustering is performed on 
the basis of the extracted feature elements, and finally performing the cluster 
merging process for each combination of two clusters of the clusters obtained via the 
above clustering process. 

30 After the first cluster merging process for each combination of two clusters, 

the result of the recategorization by the cluster merging is obtained as shown in 
Fig. 15. Thereafter, a second cluster merging process is performed upon the 
recategorization result obtained through the first cluster merging process. That is, 
in the result of the first cluster merging process shown in Fig. 15, a cluster merging 

35 process is performed for a combination of the cluster of "paper + cassette" and the 
cluster of "installation". In this example, as for the combination of clusters of 
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"paper + cassette" and "installation", the cluster "paper + cassette" includes four 
documents Dl, D4, D6, the cluster of "installation" includes four document D2, D3, 
D5, and D7, and only one document D7 is included in both clusters. The ratio (%) of 
the number of common documents to the total number of documents is calculated as 
5 follows. The number of common document, equal to 1, is first multiphed by a factor 
of 2 and then divided by the total number of documents, equal to 8, and further 
multiplied by 100, and thus the result is obtained as 25%, which is lower than the 
threshold value (70%). Thus, it is determined that these clusters should not be 
merged. 

10 After completion of the first cluster merging process for each combination of 

two clusters, a second cluster merging process is performed for each combination of 
two clusters of the clusters obtained via the fist cluster merging process. After 
completion of the second cluster merging process, a third cluster merging process is 
performed for each combination of two clusters of the clusters obtained via the 

15 second cluster merging process. The above process is performed repeatedly until no 
further cluster is created (until no clusters are merged). 

Although in the above example, the cluster merging process is performed for 
a combination of two clusters, the cluster merging process may be performed for a 
combination of three or more clusters. In this case, in a first cluster merging 

20 process, cluster merging is performed for each combination of three or more 
clusters. Thereafter, cluster merging may be performed repeatedly for the result 
obtained via the previous cluster merging process until no further merging occurs. 
As in the previous case, the judgment as to whether three or more clusters should 
be merged can be performed on the basis of the ratio (%) of the number of common 

25 documents to the total number of documents included in these clusters. 

After the cluster merging unit 92 shown in Fig. 9 completes the cluster 
merging process, the cluster-merging-process information generator 93 determines 
the degrees of relation among the original clusters merged together by the cluster 
merging unit 92 and generates cluster-merging-process information on the basis of 

30 the degrees of relation so that the cluster-merging-process information represents 
the properties of the original clusters combined together into the final clusters, that 
is, so that the cluster-merging-process information indicates which clusters have 
been combined together into final clusters and also indicates the degrees of relation 
among the original clusters combined together. The process performed by the 

35 cluster-merging-process information generator 93 is described in further detail 
below. 
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In the present embodiment, the relation evaluator 931 evaluates the degrees 
of relation among clusters merged together by determining whether the cluster 
correlation scores (%) calculated by the cluster merging unit 92 are much greater 
than the above-described threshold value TH or close to the threshold value TH. 
5 More specifically, a threshold value THl is set to a value (%) higher than the above- 
described threshold value TH, and if the cluster correlation score (denoted by K) 
calculated by the cluster merging unit 92 is equal to or higher than THl (K > THl, 
the clusters are determined as having very close relation, that is, as being very 
similar to each other. On the other hand, if the cluster correlation score K 
10 calculated by the cluster merging unit 92 is within a range THl > K > TH, the 
clusters are determined as being similar to each other to a certain extent. 

If K > THl, that is, if clusters merged into a final cluster have been 
: determined as having very close relation, the following process is further performed. 

I In the case of the specific example shown in Fig. 15, a final cluster created via 

15 the cluster merging process has a feature element of "paper + cassette". This 
cluster of "paper + cassette" is obtained as a result of merging the cluster of "paper" 
and the cluster of "cassette" shown in Fig. 13. 

The clusters may be named as follows. For example, a cluster having a 
feature element of "paper" is named "paper cluster", and a cluster having a feature 
20 element of "cassette" is named "cassette cluster". Hereinafter, the cluster names 
are represented more simply as "paper" and "cassette". 

The cluster correlation score of the cluster of "paper + cassette" created by the 
cluster merging process has been calculated as 86% by the cluster merging unit 92. 
Herein we assume that the threshold value THl used by the relation evaluator 931 
25 to evaluate the degree of relation is set 80%. In this case, the cluster correlation 
score K calculated by the cluster merging unit 92 satisfies the condition K > THl, 
and thus the paper cluster and the cassette cluster are determined as having very 
close relation and being very similar to each other. 

As described above, when a cluster correlation score K calculated by the 
30 cluster merging unit 92 is equal to or greater than THl (K > THl), original clusters 
merged together into a final cluster can be regarded as having very close relation 
and being very similar to each other. Thus, in such a case, the name of the final 
cluster is given by a combination of the original cluster names which are displayed 
by successively the original cluster names. In the case of "paper cluster" and 
35 "cassette cluster", the cluster names "paper" and "cassette" can be combined into 
"paper cassette". 
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That is, in this case, the cluster names are displayed in an AND form. This 
method is employed when a simple combination of cluster names does not result in 
a problem. In this specific example, the final cluster created via the cluster merging 
process is named "paper cassette". The naming of the final cluster as "paper 
5 cassette" can be judged as proper from the contents of the documents (Fig. 10) in the 
paper cluster and the cassette clusters merged into the final cluster. 

Fig. 16 illustrates an example of information displayed after the above 
process. In this specific example shown in Fig. 16, "paper cassette" is displayed as 
the cluster name of the final cluster created via the cluster merging process, and the 
10 title names of the documents (Dl, D4, D6, D7, shown in Fig, 10) included in this 
cluster are displayed. 

Instead of displaying the original cluster names successively in a single line 
as shown in Fig. 16, the individual original cluster names "paper" and "cassette" 
corresponding to the original clusters may be displayed in different adjacent lines as 
15 shown in Fig. 17. 

In the case where the original cluster names are displayed in different lines, 
unnatural or incongruous linguistic continuity can be avoided. Although in this 
specific example, no problems occur when "paper" and "cassette" are combined and 
represented in a single line as "paper cassette", combining of cluster names into a 

20 single line can be incongruous depending upon the specific cluster names. For 
f example, aside from the above example, when a final cluster is created by merging 
clusters having names of "product", "usage", "outline", if the cluster names is 
displayed successively in a single fine, the result is "product usage outline". 
Although this is not absolutely unclear in meaning, it is somewhat incongruous in a 

25 linguistic sense. In such a case, language processing may be performed to obtain a 
better expression such as "outline of usage of products". However, the language 
processing would be complicated and a long processing time would be needed. 

In this specific case, the incongruence can be avoided by displaying "product", 
"usage", and "outline" in different lines. Another advantage of displaying cluster 
30 names in different lines is that when a large number of clusters are combined 
together, displaying of cluster names in different lines prevents the cluster names 
from extending over a too long length along a horizontal line. 

As described above, when the cluster correlation score K calculated by the 
cluster merging unit 92 satisfies the condition K > THl, the cluster names of 
35 original clusters merged into a final cluster are displayed in the AND form in which 
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the original cluster names are arranged in a single horizontal line or displayed in 
different lines. 

This makes it possible for the user to easily understand which clusters have 
been merged into a final cluster simply by seeing the cluster name of the final 
5 cluster. For example, in the case of the specific example shown in Fig. 16 or Fig. 17, 
it can be easily understood that the final cluster has been created by combining the 
original clusters having cluster names "paper" and "cassette" and that the original 
clusters have very close relation, that is, the documents included therein have 
similar contents. 

10 When THl > K > TH, that is, when the degree of relation among original 

clusters which have been merged into a final cluster is not very high but some 
similar documents are included in the clusters, the process is performed as follows. 

When the cluster correlation score K calculated by the cluster merging unit 
92 is within the range THl > K > TH, the original cluster names are represented in 
13 an OR form. 

In the specific example described above, the original cluster names "product", 
"usage", and "outhne" are represented not in a simple successive fashion but in a 
; fashion in which a delimiter is placed between adjacent cluster names such as 
""product-usage-outline". If a user is informed in advance that a delimiter placed 
20 between adjacent cluster names represents "OR", the user can understand that the 
" final cluster obtained via the cluster merging process includes some documents 
having contents about "product", "usage", or "outhne". When a final cluster name is 
represented in the OR form, the delimiter placed between original cluster names is 
not limited to a dot as is used in "produce usage-outhne" but other types of 
25 delimiters may be used. For example, "/" may be placed between original cluster 
names such as "product/usage/outline". 

In some cases, the cluster correlation score K for some original clusters 
included in a final cluster is equal to or greater than THl (K > THl) but the cluster 
correlation score K for some other original clusters in the same final cluster is in the 
30 range THl > K > TH. In this case, the final cluster name is represented in the form 
of a mixture of AND and OR expressions so that the degrees of relation are 
indicated in the final cluster name. 

In some cases, in a final cluster obtained by merging original clusters, some 
original clusters may be included in another original cluster. For example, when 
35 clusters having cluster names "product", "television" "radio", and "video" are merged 
together into a final cluster, if the respective clusters "television", "radio" and 
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"video" are included in the cluster "product" and if the cluster correlation scores K 
are within the range THl > K > TH, the cluster name of the final cluster is 
expressed as "product- (tele vision-radio- video)". The dots in this expression indicate 
that "product", "television", "radio", and "video" have relation of OR. Furthermore, 
5 the brackets enclosing therein "television", "radio", and "video" indicates that 
clusters "television", "radio", and "video" are included in the cluster "product". 

As described above, only by seeing the cluster names of the final clusters 
obtained via the cluster merging process, it is possible to know which clusters have 
been combined together into which final clusters and also can know the degrees of 
10 relations among the original clusters combined together. 

Note that the present embodiment is not limited to the specific examples 
described above, but various modifications are possible without departing from the 
spirit of the embodiment. For example, although in the above example, feature 
elements to be used to obtain a categorization result such as that shown in Fig. 13 
15 are extracted from the titles of the respective documents, and clustering is 
performed on the basis of the feature elements extracted from the titles, the manner 
of clustering a plurahty of documents is not limited to such a particular method. 

For example, instead of clustering documents on the basis of feature elements 
extracted from the titles of the documents according to the above embodiment, 

20 clustering may also be performed according to URL addresses (after removing 
=■ "http://", the remaining part is employed), updated date/time (without any 
restriction or within last one moth), or file sizes (the sizes of the Web pages in 
bytes). One of these items or some combination of these items may be employed in 
the clustering process. A desired item can be selected, for example, from a menu. 

25 In the case where a selected item is not included in a document, another item may 
be employed instead of the selected item. For example, when the title is selected as 
the item, if a Web page does not include a title, a URL address may be employed. 

After performing the clustering using one of the methods, the judgment as to 
whether clusters should be merged or not is made by evaluating the similarity 
30 between the clusters under consideration in the manner described above. 

For example, let us consider an example in which clustering is performed 
according to URL addresses. We assume herein that documents have been 
categorized into a cluster of a certain URL (URLl) and a cluster of another URL 
(URLl). We further assume that the cluster of URLl includes documents Dl, D2, 
35 D3, and D4, and the cluster of URL2 includes documents D2, D3, D4, and D5. In 
this case, documents which are commonly included in both clusters are D2, D3, and 
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D4. The ratio of the number of common documents to the total number of 
documents is calculated, and it is determined according to the calculated ratio 
whether the cluster of URLl and the cluster of URL2 should be merged or not. 

Although in the above embodiment, the judgment as to whether clusters 
5 should be merged or not is made by comparing the ratio (%) of the number of 
documents commonly included in clusters under consideration to the total number 
of documents to a predetermined threshold value (%), the manner of the judgment 
is not limited to that. For example, the judgment as to whether clusters should be 
merged or not may be made in accordance with the number of common documents 
10 relative to the numbers of documents included in the respective clusters. 

When the judgment as to whether to merge clusters is made on the basis of 
the number of clusters, the threshold value may be represented in the number of 
clusters. For example, when the total number of document is 10, if it is desired to 
perform merging when the number of common documents is equal to greater than 7, 

15 then the threshold value TH is set to 7 and THl to 9. In this case, when the 
number of common documents is equal to or greater than 9, the cluster name of a 
resultant cluster obtained via the cluster merging process is expressed in the AND 
form, while when the number of common documents is within the range from 7 to 8, 
the cluster name of a resultant cluster is expressed in the OR form. Note that the 

20 above threshold values used herein or in the previous embodiment are given as 
mere examples, and they are not hmited to those specific examples. 

In the above embodiment, different documents Dl, D2,..., D7 are categorized. 
The embodiment may also be applied to the case where a single document is divided 
into a plurality of contents (into parts having their own themes) and the respective 
25 contents are categorized. Herein, let us assume that contents are given by dividing 
a single document at respective titles into plural parts each describing their own 
particular themes. 

For example, if it is assumed that the documents Dl, D2,..., D7 shown in Fig. 
10 are parts of the same single document, these documents Dl, D2,...., D7 can be 
30 regarded as contents in the above-described sense. In this case, the respective 
contents include titles Tl, T2,..., T7 and main bodies Al, A2,..., A7. 

As described above, the present invention may be applied to the case when a 
signal document is divided into a plurality contents, then the contents are 
categorized into clusters, and finally similar clusters in the obtained clusters are 
35 merged. 
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Furthermore, the present embodiment may also be appHed to a plurahty of 
documents obtained via a general retrieval service. In this case, the clustering 
process is first performed for a large number of documents obtained via the 
retrieval, and the cluster merging process is then performed for the result of the 
5 clustering process. Thereafter, the process described above may be performed upon 
the clusters obtained via the cluster merging process so that it can be easily see 
which clusters have been merged together into which clusters and see the degrees of 
the relations. 

A program used to execute the above-described document categorizing 
10 process according to the present embodiment may be stored on a storage medium 
such as a floppy disk, an optical disk, or a hard disk. Note that such a storage 
medium also falls within the scope of the present invention. The program may also 
be obtained via a network. 
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CLAIMS: 

1. A document categorizing method for categorizing a plurality of documents into 
a plurality of clusters according to semantic similarity, said method being 
characterized in that: 

5 after categorizing said plurality of documents into a plurality of clusters 

according to semantic similarity, a cluster merging process is performed such that 
relations among clusters of said plurality of clusters are evaluated on the basis of 
documents included in the respective clusters, and two or more clusters having a 
degree of relation equal to or higher than a predetermined value are combined 
10 together. 

2. A document categorizing method according to Claim 1, wherein said cluster 
merging process is performed such that the evaluation of relations among clusters 
under consideration as to whether they should be merged or not is performed on the 
basis of the number of documents commonly included in said clusters under 

15 consideration relative to the total number of documents included in said clusters 
under consideration, and cluster merging is performed in accordance with the 
evaluation result. 

3. A document categorizing method according to Claim 1, wherein said cluster 
merging process is performed such that in what manner feature elements, which 

20 characterize respective clusters under consideration as to whether they should be 
merged or not, appear in the respective clusters under consideration is examined, 
and cluster merging is performed in accordance with the manner in which the 
feature elements appear. 

4. A document categorizing method according to one of Claims 1 to 3, wherein 
25 said cluster merging process is performed at least for two clusters, and after 

completion of the cluster merging process a first time, the cluster merging process is 
performed repeatedly for the resultant set of clusters until no further cluster 
merging occurs. 
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5. A document categorizing method according to one of Claims 1 to 4, wherein 
after completion of said cluster merging process, supplementary information 
indicating that cluster merging has been performed and also indicating the basis on 
which the cluster merging has been performed is output. 

5 6. A document categorizing method for categorizing a plurahty of documents into 
a plurality of clusters according to semantic similarity, said method being 
characterized in that: 

after categorizing said plurality of documents into a plurality of clusters 
according to semantic similarity, a cluster merging process is performed such that 
liQ, relations among clusters of said plurahty of clusters are evaluated on the basis of 
documents included in the respective clusters, and two or more clusters having a 
> degree of relation equal to or higher than a predetermined value are combined 
together; and 

information representing which clusters have bee merged together and also 
15r; representing the degrees of relation among the merged clusters is generated and 
said information is output together with the categorization result to be presented to 
a user so that when final clusters obtained as a result of said cluster merging 
= process are displayed, the user can see in what manner said cluster merging process 
has been performed to obtain said final cluster. 

20 7. A document categorizing method according to Claim 6, wherein said 
information output so as to enable the user to see in what manner said cluster 
merging process has been performed is given by modifying the manner of displaying 
the cluster names of respective clusters merged together in accordance with the 
degree of relation among said clusters merged together in such a manner that when 

25 said degree of relation among said clusters is higher than a predetermined value, 
said cluster names are displayed in an AND form, however when said degree of 
relation among said clusters is lower than the predetermined value, said cluster 
names are displayed in an OR form. 
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8. A document categorizing method according to Claim 7, wherein when said 
cluster names are displayed in the AND form, said cluster names of the respective 
clusters are displayed successively in a single horizontal line or the respective 
cluster names are displayed in different hnes, while when said cluster names are 

5 displayed in the OR form, a delimiter is inserted between adjacent cluster names of 
the respective clusters. 

9. A document categorizing method according to Claim 7 or 8, wherein when a 
certain cluster includes a cluster therein, the name of said cluster included in said 
certain cluster is enclosed within brackets and placed after the name of said certain 

10 cluster. 

10. A document categorizing apparatus for categorizing a plurality of documents 
into a plurality of clusters according to semantic similarity, said apparatus 
comprising: 

a clustering unit for categorizing a plurality of documents into a plurality of 
1^ clusters in accordance with semantic similarity; and 

i'y a cluster merging unit which evaluates the relation among the plurality of 

clusters created by the clustering unit on the basis of the documents included in the 
respective clusters and then combines two or more clusters having a degree of 
relation equal to or higher than a predetermined value. 

20 11. A document categorizing apparatus for categorizing a plurality of documents 
into a plurality of clusters according to semantic similarity, said apparatus 
comprising: 

a clustering unit for categorizing a plurality of documents into a plurality of 
clusters in accordance with semantic similarity, 

25 a cluster merging unit which evaluates the relation among the plurality of 

clusters created by the clustering unit on the basis of the documents included in the 
respective clusters and then combines two or more clusters having a degree of 
relation equal to or higher than a predetermined value; 
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a cluster-merging-process information generator for generating cluster- 
merging-process information representing which, clusters have been merged 
together and also representing the degrees of relation among the merged clusters 
wherein said cluster-merging-process information is to be displayed when final 
5 clusters obtained via said cluster merging process performed by said cluster 
merging unit are displayed so that a user can see in what manner said cluster 
merging process has been performed to obtain said final cluster; and 

categorization result outputting means for outputting said cluster-merging- 
process information such that said cluster-merging-process information is included 
10 in the categorization result to be presented to said user. 

12. A storage medium on which a document categorizing program for categorizing 
a plurahty of documents into a plurality of clusters according to semantic similarity 
is stored, said document categorizing program comprising: 

a clustering step for categorizing a plurality of documents into a plurality of 
15 clusters in accordance with semantic similarity, and 

a cluster merging step in which the degrees of relation among clusters of 
said plurality of clusters obtained in said clustering step are evaluated on the basis 
of documents included in the respective clusters, and two or more clusters having a 
degree of relation equal to or higher than a predetermined value are combined 
20 together. 

13. A storage medium on which a document categorizing program for categorizing 
a plurality of documents into a plurality of clusters according to semantic similarity 
is stored, said document categorizing program comprising: 

a clustering step for categorizing a plurality of documents into a plurality of 
25 clusters in accordance with semantic similarity; 

a cluster merging step in which the degrees of relation among clusters of 
said plurality of clusters obtained in said clustering step are evaluated on the basis 
of documents included in the respective clusters, and two or more clusters having a 



PatAppAF 

Customer No. 20178 



39 



Express Mail Label No. EL700476068US 
Rev. 11/97 



P5276b 



degree of relation equal to or higher than a predetermined value are combined 
together; 

a cluster-merging-process information generating step for generating 
cluster-merging-process information representing which clusters have been merged 
5 together and also representing the degrees of relation among the merged clusters 
wherein said cluster-merging-process information is to be displayed when final 
clusters obtained via said cluster merging process performed by said cluster 
merging unit are displayed so that a user can see in what manner said cluster 
merging process has been performed to obtain said final cluster; and 

10 a step for outputting said cluster-merging-process information such that 

said cluster-merging-process information is included in the categorization result to 
be presented to said user. 
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ABSTRACT 

A document categorizing apparatus includes a sentence analyzer 12 for 
analyzing a plurality of documents to detect titles thereof; a feature element 
extractor 13 for extracting feature elements from the titles detected by the sentence 
5 analyzer 12 from the respective documents; feature table generating means 14 for 
generating a feature table representing the relationships between the feature 
elements extracted from the title and the documents including the feature elements; 
a document categorizing unit 15 for categorizing the documents into a plurality of 
clusters according to semantic similarity on the basis of the content of the feature 
10 table; a categorization result storage unit 16 for storing the clusters created by the 
document categorization unit 15; a cluster merging unit 2 for performing a cluster 
merging process upon the clusters stored in the categorization result storage unit 6; 
and an output control unit 31 for outputting the result of the cluster merging 
process to a display unit 32. 
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^ Paper Cassette 

The machine is equipped with a universal paper cassette in the standard configuration. The 
universal paper cassette can be replaced wiih a high-capacity paper cassette dedicated to supplying 
A4-size paper. In the standard configuration, 200 sheets of paper can be set on the tray. In adaiiion 
to the standard universal paper cassette , ... 



Installation of Additional Memory 
_^ The memory of the laser printer can be extended by installing an additional memory. The extention 
of memory provides improvements such as a reduction in time needed for the operation of the personal 
computer. Note that the degree of improvement depends upon the environment in which the laser printer 
is used. When image data is printed, installation of a ... Mb additional memory is recommended. 
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^ i Installation of an Optical Interface Card 

V An optional interface card can be used to connect the printer directly to a network in this case, , 
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^Switichinp the Paper Setting Among "Tray", "Cassette" and "Auto". 

When printing is performed using one of various applications, setting of the paper feeder and the ppi 
size is necessary. Check that the paper guide clip on the paper cassette is located at a correct position 
corresponding to the paper size. When the printer is used in an environment other than a network 
environment ... 



^ Installation of an Additional Hard Disk Drive to Install Additional Fonts 

1^ Additional fonts can be installed by ... An optional font ROM can also be installed to do ., 

When an additional hard disk drive is installed to install a commercially available font ... 



J- Smear Created on Printer Paper 

When a black smear is created in an end region of paper or on the back surface of paper, clean the 
fixing roller in the printer. Select "Printing for Cleanina" from the setting menu provided on the panel 
of the printer and perform printing for cleaning on A4-size cleaning paper . Thereafter, ... 

It is recomended to clean the fixing roller periodically in addition to the cleaning performed after 
replacing the cartridge. Reproduced paper can be used for the purpose of protecting natural 
environments. 



^ Installation of an Additional Paper Cassette 

Up to two optional double cassette units or high-capacity paper cassette units dedicated to 
supplying A4-si2e paper can be installed. 

When two double cassette units are added, it becomes possible to ... 
The A3 universal paper cassette can be replaced with ...to ,., 
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r Cassette 



^ The machine is equipped with a yniversal paper cassette in the standard configuration. The 
universal paper cassette can be replaced with a high-capacity paper cassette dedicated to supplying 
A4-size paper. In the standard configuration, 200 sheets of paper can be set on the tray. 



- * Installation of Additional Memory 

^ The memory of the laser printer can be extended by installina an additional memory. The extention 
of memory provides improvements such as a reduction in time neeaed for the operation of the persona! 
computer. Note that the degree of improvement depends upon ... 



* Installation of an Optical Interface Card 

An optional interface card can be used to connect the printer directly to a network in this case,.,. 
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-* Switching the Paper Setting Among "Tray", "Cassette" and "Auto". 
_ When printing is performed using one of various applications, setting of the paper feeder and t! 
size is necessary. Check that the paper guide clip on tne paper cassette is ... 
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-* Installation of an Additional Hard Disk Drive to Install Additional Fonts 

^ Additional fonts can be installed by ... An optional font ROM can also be installed to ., 



Smear Created on Printer Paper 
_ When a black smear is created in an end region of paper or on the back surface of paper, clean the 
fixing roller in the printer. Select 'Printing for Cieanina' from the setting menu provided on the panel 
of the printer and perform printing for cleaning on A4-size cleaning paper. Thereafter, ... 
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Installation of an Additional Paper Cassette 
_ Up to two optional double cassette units or high-capacity paper cassette units dedicated to 
supplying A4-size paper can be installed. 
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i&ib ^ ^"o Pegulationa, Section 1.58. 
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PTO/aa/ioe (a-se) 

Ajjprovad for uto throuBh 9/30/8B 0MB 0651-0032 
Patani and Tra«i«nwk Offlce; U.S. DEPARTMENT OF COMMERCE 
Under ihc Papemorh Rsduatlon Act of 189d. no paraone are raquirad to rsspond to ootlsction of Infbmiaticin uniess It diepjityii a valid 0MB oontrol number. 

Japanese Language Deciaration 



Prior Foreiin Application{3) 

n-2l2501(P) Japan 
(Numfaar) 

11-158498(P) 
(Number) 



(Application No:) (Filing Date) 

3 6 5^(c)?cS^<IS*J^:ie:{c£®Lir„ /^ffifiia) 



PCT/JPOO/03625 2/June/2000 
(Application No) (Filing Date) 



(Application ^ioJ (Fiiif\g Date) 

(tH)Efi#y-) (fflSIR) 

U i-^TiC:fc.lr.iTdcDz:fc<21fftScl'*t. 



I hereby claim foreign priority undcar Title 35, Unitad States 
Coda, Saction 119 (a)-(d) or 3e5(b) of any foreign application<s) 
for patent or inventor's certificate, or 36S(a) erf any PCT 
International application which designated at le^st one country 
other than tha Unitad States, listed belov/ and have also 
identified below, by checking the box. any foreign application for 
patent or inventor's certificata, or PCT Internationa! application 
having a filing data before that of the application on which 
priority is claimed. 

Priority Not Claimed 
27/July/1999 Q 

□ 



I hereby claim the benefit ynder Tftle 35, Unfed States Code, 
Section 1 1 9 (e) of any United States provieional appltcaition(B) 
listed below. 



(Application NoJ (Filing Data) 

I hereby claim the benefit under Title 35, Unffced Statas Coda, 
Saotion 120 of any United States applioation<3), or 365 (c) of 
any PCT International application designating the United States, 
listed below and, insofar as the subject matter cf each of the 
claims of this application is not disclosed in the prior United 
States or PCT International application in the manner provided 
by the first, paragraph of Title 35, Unrtod States Code, Saotion 
112, I acknowledge the duty to disclose information which is 
material to patontabiirty as defined in Title 37, Code of Federal 
Regulations, Section 1.56 which became available beteveen the 
filing date of the prior application «nd the nartional or PCT 
frrlemational filing date of application: 

Pending 

(status: Patented, Pending. Abandoned) 



(Statue: Patented. Pending, Abandoned) 

I hereby declare that all statements made herain of my own 
knowiedga are true and ^at all statements made on information 
and belief are believed to be true; and further that these 
statements were made with the knowledge that wilHul false 
statemer\ts and the like bo made are punishable by iins or 
imprisonment, or both, under Section 1001 of Title 13 of the 
Unitad States Code and that euoh willfijl false statements may 
jeopardize the validity of the application or any patent issued 
t)-ierBon. 



(Country) (Day/Month/Year Filed) 

Japan 4/ June/1 999 

(Country) (Day/Month/Year Filed) 
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Under Ihe Popervrork Haduction Act of 1 885. n 
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FTO/SB/1Q6 (B-96} 
Aptroved for use Ihrouah 3/30/38 OMB 0651-0032 
Pstont And Trademark Offiae; U.S. DHHARTMENT OF COMMERCG 
e requirad to r«spond to aoUection of informatioo unjaaa it dixplaya a valid OMB oortrol number. 



Japanese language Declaration 



Customer Number 201 7B 
Mark P, Watson, (Reg. 31.448) 
Michael T. Qabrik. (Beg. 32,896) 
Rosaiio Haro, (Reg. 42.633) 



POWER OF ATTORNEY: As a named inventor, I heraby appoint 
the following attQmev<3) and/or agerrtCs) to prosecute this 
application and tranaijot all busheoa in the Patent and Trademark 
Office connected therewith, f/ist nama and registration number) 



20178 



il 



Customer Numb«r 201 78 

EPSON Research and Developrnent Inc. 
3 150 Rivar Oaks Parkway. Suite 225 
3 San Jose, OA 95134 

Attn: Intellectual Property Department 



U Mark R Watson (408 952-61 24) or 
=* Michael T. Qabrik (408) 952-6126 or 
y RosalioHaro (408) 952-6131 



Send Corr«spondonoe to: 
Customer Number 20178 
EPSON Research and Development, Inc. 
150 River Oaks Parkway, Suite 225 
San Jose. CA 95134 
Attn: Intellectual Property Department 



Direct Telephone Caile to: (noma and telephone number) 
Mark P. Watson (408 952-8124) or 
Michael T. Qabrik (408) 952-€126 or 
RosalioHaro (406) 952-6131 



Full name of sola or first invenbor 

Shinii MrWA„._ 



2.og/4- 



Inventor's signature ^-^ 



Date 



Residence 



Citizenship 
Japan 



392-S502 a*a:5P»afit&lP*la3TS3#6-^ 



Post Office Address 

c/o Seiko Epson Corporation 

3-5. Owa 3-chome. Suwa-shi. Nagano-ken 392-8502 Japan 



Full name of second joint inventor, if any 
_Jiichlhiro NA^^ 



Second Inventor's signature 



CitizeriGhip 
Japan 



392-8502 H*afiB^!*afilfi?!i:*:a3Ti3#B-* 



post Offioa Addrase 

c/o Seiko Epson Corporation 

3-5, Owa 3-chome, Suwa-shi, Nagano-kan 392-8502 Japan 



(Supply similar riformation and signature for tJiird and subsequent 
joint inventors.) 



Page 3 of 3 



